`torch`/`jax` Dataloader support #55

shenoynikhil · 2024-03-24T07:21:06Z

Currently, getting a torch_geometric Dataloader is quite complicated since it requires multiple steps in the process. This PR addresses those concerns by introducing,

A method to get torch tensors from getitem call similar to with_format of huggingface datasets Link. This now allows us to get numpy (default), torch or jax arrays from __getitem__ call.

For example,

from openqdc import Dummy

import numpy as np; ds = Dummy(); print (isinstance(ds[0]['positions'], np.ndarray)) # prints True
import torch; ds = Dummy(array_format="torch"); print (isinstance(ds[0]['positions'], torch.Tensor)) # prints True
import jax; ds = Dummy(array_format="jax"); print (isinstance(ds[0]['positions'], jax.numpy.ndarray)) # prints True

Added an option to add a transform. In cases, where we want to use the torch_geometric Data object, instead of the sklearn Bunch from getitem, it might be convenient to use a function on the data bunch returned.

from torch_geometric.data import Data
def custom_transform(bunch): 
    return Data(z=bunch.atomic_numbers, pos=bunch.positions, e=bunch.energies, f=bunch.forces)
ds = Dummy(array_format="torch", transform=custom_transform)
print (isinstance(ds[0], Data)) # prints True

Last but not the least, in case you want to create a pytorch geometric dataloader, you can do so like this. I still haven't created a method explicitly to do so because the amount of effort now would be reduced quite a bit. If you still think it's valuable for the user to get a dataloader, I can implement it.

Note: I looked through huggingface datasets, they too do not have any such method to get a dataloader from their datasets.

from torch_geometric.data import Data, DataLoader
def custom_transform(bunch): return Data(z=bunch.atomic_numbers, pos=bunch.positions, e=bunch.energies, f=bunch.forces)
ds = Dummy(array_format="torch", transform=custom_transform)
dl = DataLoader(ds, batch_size=4)
batch = next(iter(dl))

TODOs

Currently I have only added tests for the array_format functionality. Will also add a test for the transform.

@prtos @FNTwin

Checklist:

Was this PR discussed in a issue? It is recommended to first discuss a new feature into a GitHub issue before opening a PR.
Add tests to cover the fixed bug(s) or the new introduced feature(s) (if appropriate).
Update the API documentation is a new function is added or an existing one is deleted.

FNTwin

I would also add a simple as_dataloader function to provide out of the box utilities at this point.
Nevertheless it is a good PR, I could nitpick/disagree a bit on the implementation as I wanted to do some dynamic inheritance based on the available packages to automatically define the return object type but simpler is better down the line

openqdc/datasets/base.py

openqdc/datasets/statistics.py

FNTwin

All good, with the conversion to tensor getting a dataset into a dataloader should be a one line (for real this time), but I still think that we should have a dummy as_iter method to return a default dataloader.
In any case thank you for the bug fixing and the work! I'll probably try to clean up the conditional import of torch and jax on my end somehow but the PR is 🔥

Nikhil Shenoy added 2 commits March 24, 2024 07:05

Added support to get torch or jax tensors

cbb494c

Added support to get torch or jax tensors

e22651f

shenoynikhil changed the base branch from main to develop March 24, 2024 07:21

Nikhil Shenoy added 2 commits March 24, 2024 17:28

undo change

0f59a04

Added transform functionality test

29929cf

shenoynikhil linked an issue Mar 24, 2024 that may be closed by this pull request

Getting Dataloaders Easily #60

Closed

shenoynikhil changed the title ~~[WIP] torch/jax Dataloader support~~ torch/jax Dataloader support Mar 25, 2024

shenoynikhil self-assigned this Mar 25, 2024

FNTwin reviewed Mar 25, 2024

View reviewed changes

FNTwin mentioned this pull request Mar 25, 2024

Descriptors and some other stuff WIP #67

Merged

Nikhil Shenoy added 6 commits March 25, 2024 15:17

Updated code based on comments

1c77666

Added list check from convert dict keys

327e363

test skipping if package noot present

5139a57

Added jax/tensor support to interaction datasets

fa141f5

removed redundant line

5197e32

minor change

d10b15b

shenoynikhil changed the base branch from develop to release April 3, 2024 17:44

Nikhil Shenoy added 6 commits April 4, 2024 16:11

Merge branch 'release' into dataloader

e331cc6

Updated array stuff for xyz dataset

1c10566

Merge branch 'release' into dataloader

f1769b3

fix bug during rebase and tests

ba22ee1

array test debug

ac593e3

undo test change and reset state

6f0d46f

shenoynikhil commented Apr 4, 2024

View reviewed changes

openqdc/datasets/statistics.py Outdated Show resolved Hide resolved

cleaner variant

a8d0016

FNTwin approved these changes Apr 4, 2024

View reviewed changes

shenoynikhil merged commit ac299a5 into release Apr 5, 2024
5 checks passed

shenoynikhil deleted the dataloader branch April 5, 2024 00:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`torch`/`jax` Dataloader support #55

`torch`/`jax` Dataloader support #55

shenoynikhil commented Mar 24, 2024 •

edited

Loading

FNTwin left a comment

FNTwin left a comment

torch/jax Dataloader support #55

torch/jax Dataloader support #55

Conversation

shenoynikhil commented Mar 24, 2024 • edited Loading

FNTwin left a comment

Choose a reason for hiding this comment

FNTwin left a comment

Choose a reason for hiding this comment

`torch`/`jax` Dataloader support #55

`torch`/`jax` Dataloader support #55

shenoynikhil commented Mar 24, 2024 •

edited

Loading